First we need to import some tools from scikit learn and the classifiers we're going to use
In [1]:
from __future__ import division, print_function
from sklearn.datasets import make_circles
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor
import numpy as np
import time
import matplotlib.pyplot as plt
%matplotlib nbagg
In [2]:
def plot_roc(fpr, tpr):
"""
Simple ROC curve plotting function.
Parameters
----------
fpr : array
False positive rate
tpr : array
True positive rate
"""
plt.plot(fpr, tpr, lw=1.5)
plt.xlim([-0.05,1.05])
plt.ylim([-0.05,1.05])
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
Scikit learn has several built in datasets as well as functions to generate random data. We'll use one of these for this example but it's straightforward to put in your own data.
In [3]:
# X is the array of features, y is the array of corresponding class labels
X, y = make_circles(n_samples=1000, noise=0.1, random_state=0)
To avoid overfitting, we split the data into a training set, used to train the algorithm, and a test set, used to evaluate its performance. There's no hard and fast rule about how big your training set should be, as this is highly problem-dependent. Here, we'll use 70% of the data as training data.
In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)
You should always rescale your features as many algorithms (including SVM and many neural network implementations) assume the features have zero mean and unit variance. They will likely underperform without scaling. In this example, the generated data are already scaled so it's unnecessary, but I leave this in to show you how it's done.
In [5]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Now we can have a look at our training data, where I've coloured the points by the class they belong to.
In [6]:
plt.figure()
plt.plot(X_train[y_train==0,0], X_train[y_train==0,1],'.')
plt.plot(X_train[y_train==1,0], X_train[y_train==1,1],'.')
plt.legend(('Class 0', 'Class 1'))
Out[6]:
Let's start classifying! Scikit-learn is fully object-oriented so each classifier is an object. While there are dozens of different algorithms available, they all behave in the same way, implementing the same functions, meaning it's very easy to swap classifiers in and out.
Here we create a K-nearest neighbours classifier object, which has several hyperparameters that are set to default values.
In [7]:
clf = KNeighborsClassifier()
print(clf)
This is the function that actually trains the classifier with our training data.
In [8]:
clf.fit(X_train, y_train)
Out[8]:
Now that the classifier is trained, we can use it to predect the classes of our test data and have a look at the accuracy.
In [9]:
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
But since the accuracy is only part of the story, let's get out the probability of belonging to each class so that we can generate the ROC curve.
In [10]:
probs = clf.predict_proba(X_test)
In [11]:
fpr,tpr, thresh = roc_curve(y_test, probs[:,1], pos_label=1)
auc = roc_auc_score(y_test, probs[:,1])
print('Area under curve', auc)
In [12]:
plt.figure()
plot_roc(fpr, tpr)
The K-nearest neighbours classifier has several hyperparameters that we've just left as default. If we optimise these instead, we get a better result. The most robust way to do this is with one of scikit-learn's cross validation methods. Here we'll use GridSearchCV. Naturally now the algorithm will take much longer to train as it has to train and evaluate performance several times.
In [13]:
t1 = time.time()
clf = KNeighborsClassifier()
# Define a grid of parameters over which to search, as a dictionary
params = {'n_neighbors':np.arange(1, 30, 1), 'weights':['distance', 'uniform']}
# cv=5 means we're doing 5-fold cross validation.
clf = GridSearchCV(clf, params, cv=5)
clf.fit(X_train, y_train)
print('Time taken',time.time()-t1,'seconds')
In [14]:
# We can see what were the best combination of parameters
print(clf.best_params_)
Let's see if the accuracy has improved
In [15]:
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
The accuracy is more or less unchanged, but we do get a slightly better ROC curve
In [16]:
probs = clf.predict_proba(X_test)
In [17]:
fpr_knn, tpr_knn, thresh = roc_curve(y_test, probs[:,1], pos_label=1)
auc_knn = roc_auc_score(y_test, probs[:,1])
print('Area under curve', auc_knn)
In [18]:
plt.figure()
plot_roc(fpr_knn, tpr_knn)
Scikit-learn is designed to let you easily swap algorithms out
In [19]:
clf = SVC(kernel='rbf', probability=True)
clf.fit(X_train, y_train)
Out[19]:
In [20]:
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
In [21]:
probs = clf.predict_proba(X_test)
fpr_svm, tpr_svm, thresh = roc_curve(y_test, probs[:,1], pos_label=1)
auc_svm = roc_auc_score(y_test, probs[:,1])
print('Area under curve', auc_svm)
In [22]:
plt.figure()
plot_roc(fpr_knn, tpr_knn)
plot_roc(fpr_svm, tpr_svm)
plt.legend(('KNN (%.3f)' %auc_knn, 'SVM (%.3f)' %auc_svm), loc='lower right')
Out[22]:
You can do cross validation to tune the hyperparameters for SVM, but it takes a bit longer than for KNN.
This is a somewhat contrived example to demonstrate over-fitting.
First we make some fake data with a few outliers in it.
In [23]:
np.random.seed(42)
x = np.linspace(-3,3, 100)
y = np.sin(x) + np.random.randn(len(x))*0.05
N = 25
outlier_ints = np.random.randint(0, len(x), N)
y[outlier_ints] += np.random.randn(N)*1
plt.figure()
plt.plot(x,y,'.')
plt.xlabel('x');
plt.ylabel('y');
In [24]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)
y_train = y_train[np.argsort(X_train)]
X_train.sort()
y_test = y_test[np.argsort(X_test)]
X_test.sort()
X_train = X_train[:, None] # sklearn doesn't like 1d X arrays
X_test = X_test[:, None]
dt1 = DecisionTreeRegressor(max_depth=10) # An overly complicated classifier
dt2 = DecisionTreeRegressor(max_depth=3) # A simpler classifier
dt1.fit(X_train, y_train)
dt2.fit(X_train, y_train)
y_train_1 = dt1.predict(X_train)
y_train_2 = dt2.predict(X_train)
y_test_1 = dt1.predict(X_test)
y_test_2 = dt2.predict(X_test)
plt.figure()
plt.plot(x,y,'.')
plt.plot(X_test, y_test_1, lw=1.5, alpha=0.5)
plt.plot(X_test,y_test_2, lw=1.5, alpha=0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.legend(('Data', 'Max depth 10', 'Max depth 3'));
You can see the more complicated decision tree learns the behaviour of the spurious outliers. It's easy to check for over-fitting, whatever metric you're using (in this case mean squared error) will show much higher performance on the training than on the test set.
In [25]:
mse_train = np.mean((y_train-y_train_1)**2)
mse_test = np.mean((y_test-y_test_1)**2)
mse_train, mse_test
Out[25]:
In [26]:
mse_train = np.mean((y_train-y_train_2)**2)
mse_test = np.mean((y_test-y_test_2)**2)
mse_train, mse_test
Out[26]:
We'll now use cross validation to automatically choose the hyperparameters and avoid over-fitting
In [28]:
dt3 = GridSearchCV(DecisionTreeRegressor(), param_grid={'max_depth': np.arange(2,12)}, cv=5)
dt3.fit(X_train, y_train)
y_train_3 = dt3.predict(X_train)
y_test_3 = dt3.predict(X_test)
mse_train = np.mean((y_train-y_train_3)**2)
mse_test = np.mean((y_test-y_test_3)**2)
mse_train, mse_test
Out[28]:
In [29]:
print(dt3.best_params_)
Why does it choose max_depth=2 when we found max_depth=3 gave a better MSE? Because Decision Trees are high-variance, meaning they're very sensitive to the exact choice of training and validation set so our previous result isn't general. This highlights why you need cross-validation and why Decision Trees alone aren't very good.
The overly-complicated model overfits the data, incorporating noise into the model. The easiest way to see if you're overfitting is to compare whatever metric you're using (e.g. mean-square-error or AUC) between training and test sets. Usually these numbers should be fairly similar. If performance is significantly higher in the training set, you're overfitting.
To avoid overfitting:
Note: The above numbers are sensitive to the random seed used and exact data set (change it and see) but on average you'll see the decision tree with higher max depth will tend to overfit.
You can try the same thing with a different algorithm (just look up a few on scikit learn), have a look at a regression problem which works in a similar way or for some real fun, try out a different dataset!